library(tidyverse)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(forcats)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())Linear models are the simplest functional form to understand. They adopt a generic form
\[Y = \beta_0 + \beta_{1}X\]
where \(y\) is the outcome of interest, \(x\) is the explanatory or predictor variable, and \(\beta_0\) and \(\beta_1\) are parameters that vary to capture different patterns. In algebraic terms, \(\beta_0\) is the intercept and \(\beta_1\) the slope for the linear equation. Given the empirical values you have for \(x\) and \(y\), you generate a fitted model that finds the values for the parameters that best fit the data.
ggplot(sim1, aes(x, y)) +
geom_point()This looks like a linear relationship. We could randomly generate parameters for the formula \(y = \beta_0 + \beta_1 * x\) to try and explain or predict the relationship between \(x\) and \(y\):
models <- tibble(
a1 = runif(250, -20, 40),
a2 = runif(250, -5, 5)
)
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
geom_point()But obviously some parameters are better than others. We need a definition that can be used to differentiate good parameters from bad parameters.
One approach widely used is called least squares - it means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation. The errors are simply the difference between the actual values for \(y\) and the predicted values for \(y\) (also known as the residuals).
dist1 <- sim1 %>%
mutate(
dodge = rep(c(-1, 0, 1) / 20, 10),
x1 = x + dodge,
pred = 7 + x1 * 1.5
)
ggplot(dist1, aes(x1, y)) +
geom_abline(intercept = 7, slope = 1.5, color = "grey40") +
geom_point(color = "grey40") +
geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")To estimate a linear regression model in R, we use the lm() function. The lm() function takes two parameters. The first is a formula specifying the equation to be estimated (lm() translates y ~ x into \(y = \beta_0 + \beta_1 * x\)). The second is the data frame containing the variables:
sim1_mod <- lm(y ~ x, data = sim1)We can use the summary() function to examine key model components, including parameter estimates, standard errors, and model goodness-of-fit statistics.
summary(sim1_mod)##
## Call:
## lm(formula = y ~ x, data = sim1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.147 -1.520 0.133 1.467 4.652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.221 0.869 4.86 4.1e-05 ***
## x 2.052 0.140 14.65 1.2e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.2 on 28 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.88
## F-statistic: 215 on 1 and 28 DF, p-value: 1.17e-14
The resulting line from this regression model looks like:
dist2 <- sim1 %>%
add_predictions(sim1_mod) %>%
mutate(
dodge = rep(c(-1, 0, 1) / 20, 10),
x1 = x + dodge
)
ggplot(dist2, aes(x1, y)) +
geom_smooth(method = "lm", color = "grey40") +
geom_point(color = "grey40") +
geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")The blue lines identify the residuals, the error for an individual observation \(e_i = y_i - \hat{y}_i\), or the difference between the actual and predicted value for the outcome of interest. Residuals can be very useful for diagnostic tests with linear regression. The distribution of residuals can tell us many things about our regression model. For instance, consider the linear relationship between displ and hwy in the mpg dataset:
# estimate linear model
mpg_lin <- lm(hwy ~ displ, data = mpg)
# plot the model using geom_smooth()
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)Not a great model (the relationship isn’t strictly linear - we’ll get to that more in a minute). But sometimes your model is better at predicting some types of observations better than others. For instance, we also know which cars are front wheel vs. rear wheel vs. four wheel drive:
ggplot(mpg, aes(displ, hwy)) +
geom_point(aes(color = drv)) +
geom_smooth(method = "lm", se = FALSE)Could our model be more accurate at predicting highway mileage for one of these three categories? We can plot the distributions of the residuals from our original model to find out:
mpg %>%
add_residuals(mpg_lin) %>%
ggplot(aes(resid, color = drv)) +
geom_density()If the model performed equally well for all three categories, the distributions should look similar to one another (ideally normally distributed centered around 0). It appears here that there are substantial differences in the distributions, suggesting perhaps we should include drv as an additional variable in our regression model in order to improve predictive accuracy:
mpg_drv <- lm(hwy ~ displ + drv, data = mpg)
mpg %>%
add_residuals(mpg_drv) %>%
ggplot(aes(resid, color = drv)) +
geom_density()By doing so, the residuals now are more similarly distributed.
Linear regression assumes the relationship between the predictors and the response is a straight line. If the true relationship is otherwise, then we cannot generate accurate inferences from the model.
ggplot(mpg, aes(displ, hwy)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)Doesn’t look linear. We could just look at this graph and make that conclusion (a very basic visualization), but what happens once we estimate a model with more than a single predictor? Instead we can use residual plots to identify non-linearity. Recall that the residual is the error for an individual observation \(e_i = y_i - \hat{y}_i\), or the difference between the actual and predicted value for the outcome of interest. Residual plots graph the relationship between the fitted values and the residuals. Ideally there should be no discernable pattern in the graph. If there is a pattern, then this indicates a problem with some aspect of the linear model.
# estimate models
mpg_lin <- lm(hwy ~ displ, data = mpg)
mpg_quad <- lm(hwy ~ poly(displ, 2, raw = TRUE), data = mpg)
mpg %>%
add_predictions(mpg_lin) %>%
add_residuals(mpg_lin) %>%
ggplot(aes(pred, resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(alpha = .2) +
geom_smooth(se = FALSE) +
labs(title = "Residual plot for linear fit")mpg %>%
add_predictions(mpg_quad) %>%
add_residuals(mpg_quad) %>%
ggplot(aes(pred, resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(alpha = .2) +
geom_smooth(se = FALSE) +
labs(title = "Residual plot for quadratic fit")Another assumption of linear regression is that the error terms \(\epsilon_i\) have a constant variance, \(\text{Var}(\epsilon_i) = \sigma^2\). This is called homoscedasticity. Estimates of standard errors rely on this assumption. If the variances of the error terms are non-constant (aka heteroscedastic), our estimates of the standard errors will be inaccurate - they will either be inflated or deflated, leading to incorrect inferences about the statistical significance of predictor variables.
We can uncover homo- or heteroscedasticity through the use of the residual plot. Below is data generated from the process:
\[Y = 2 + 3X + \epsilon\]
where \(\epsilon\) is random error distributed normally \(N(0,1)\).
sim_homo <- data_frame(x = runif(1000, 0, 10),
y = 2 + 3 * x + rnorm(1000, 0, 1))
sim_homo_mod <- glm(y ~ x, data = sim_homo)
sim_homo %>%
add_predictions(sim_homo_mod) %>%
add_residuals(sim_homo_mod) %>%
ggplot(aes(pred, resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(alpha = .2) +
geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
labs(title = "Homoscedastic variance of error terms",
x = "Predicted values",
y = "Residuals")Compare this to a linear model fit to the data generating process:
\[Y = 2 + 3X + \epsilon\]
where \(\epsilon\) is random error distributed normally \(N(0,\frac{X}{2})\). Note that the variance for the error term of each observation \(\epsilon_i\) is not constant, and is itself a function of \(X\).
sim_hetero <- data_frame(x = runif(1000, 0, 10),
y = 2 + 3 * x + rnorm(1000, 0, (x / 2)))
sim_hetero_mod <- glm(y ~ x, data = sim_hetero)
sim_hetero %>%
add_predictions(sim_hetero_mod) %>%
add_residuals(sim_hetero_mod) %>%
ggplot(aes(pred, resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(alpha = .2) +
geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
labs(title = "Heteroscedastic variance of error terms",
x = "Predicted values",
y = "Residuals")We see a distinct funnel-shape to the relationship between the predicted values and the residuals. This is because by assuming the variance is constant, we substantially over or underestimate the actual response \(Y_i\) as \(X_i\) increases.
An outlier is an observation for which the predicted value \(\hat{y}_i\) is far from the actual value \(y_i\). Sometimes outliers are simply coding errors in the original dataset, but sometimes they are extreme or unusual values that were not generated by the same data generating process as the remaining dataset. Detecting outliers is the first step to deciding how to handle them (i.e. keep or remove them from the analysis).
We can use a few different visualizations to detect outliers. In a bivariate linear regression model, simply plot the variables against one another and draw the best-fit line:
sim <- data_frame(x = rnorm(20),
y = -2 + x + rnorm(20)) %>%
mutate(y = ifelse(row_number(x) == 15, y + 10, y),
outlier = ifelse(row_number(x) == 15, TRUE, FALSE))
ggplot(sim, aes(x, y)) +
geom_point(aes(color = outlier)) +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(data = filter(sim, outlier == FALSE),
method = "lm", se = FALSE, linetype = 2) +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")Or we can draw another residual plot (either raw or standardized):
sim_lm <- lm(y ~ x, data = sim)
sim %>%
add_predictions(sim_lm) %>%
add_residuals(sim_lm) %>%
ggplot(aes(pred, resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(aes(color = outlier)) +
labs(x = "Fitted values",
y = "Residuals") +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")augment(sim_lm) %>%
left_join(sim) %>%
ggplot(aes(.fitted, .std.resid)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(aes(color = outlier)) +
labs(x = "Fitted values",
y = "Standardized residuals") +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")High-leverage points are observations with have strong effects on the coefficient estimates in a linear model. Influence is a function of leverage and discrepancy:
\[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]
Various statistical measures exist for quantifying leverage and discrepancy. Leverage is frequently defined by the leverage statistic (also known as the hat value):
\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i'=1}^{n} (x_{i'} - \bar{x})^2}\]
Residuals are commonly used to measure discrepancy (either raw, standardized, or studentized). Cook’s distance (or Cook’s D) combines these two measures to calculate an observation’s leverage:
\[D_i = \frac{\tilde{u}_i^2}{K} \times \frac{h_i}{1 - h_i}\]
where \(\tilde{u}_i^2\) is the squared standardized residual, \(K\) is the number of parameters in the model, and \(\frac{h_i}{1 - h_i}\) is the hat value.
Where is the visualization? By combining all three variables into a “bubble plot”, we can visualize all three variables simultaneously. For example, here are the results of a basic model of the number of federal laws struck down by the U.S. Supreme Court in each Congress, based on:
library(haven)
dahl <- read_dta("data/LittleDahl.dta")
dahl_mod <-lm(nulls ~ age + tenure + unified, data = dahl)
dahl_augment <- dahl %>%
mutate(hat = hatvalues(dahl_mod),
student = rstudent(dahl_mod),
cooksd = cooks.distance(dahl_mod))
# use size
ggplot(dahl_augment, aes(hat, student)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point(aes(size = cooksd), shape = 1) +
geom_text(data = dahl_augment %>%
arrange(-cooksd) %>%
slice(1:10),
aes(label = Congress)) +
scale_size_continuous(range = c(1, 20)) +
labs(x = "Leverage",
y = "Studentized residual") +
theme(legend.position = "none")# use color and geom_text_repel
ggplot(dahl_augment, aes(hat, student, color = cooksd)) +
geom_hline(yintercept = 0, linetype = 2) +
geom_point() +
ggrepel::geom_text_repel(data = dahl_augment %>%
arrange(-cooksd) %>%
slice(1:10),
aes(label = Congress)) +
labs(x = "Leverage",
y = "Studentized residual") +
theme(legend.position = "none")The bubble plot tells us several things:
What is the relationship between happiness and gender? We could identify this in several different contingency tables, depending on the probability distribution on which we wish to focus:
# Mosaic plot of happiness and education
library(productplots)
data("happy")
happy <- happy %>%
na.omit# contingency table
library(descr)
crosstab(happy$happy, happy$sex, prop.t = TRUE, plot = FALSE)## Cell Contents
## |-------------------------|
## | Count |
## | Total Percent |
## |-------------------------|
##
## =======================================
## happy$sex
## happy$happy male female Total
## ---------------------------------------
## not too happy 1904 2460 4364
## 5.5% 7.1%
## ---------------------------------------
## pretty happy 8760 10539 19299
## 25.2% 30.3%
## ---------------------------------------
## very happy 4833 6327 11160
## 13.9% 18.2%
## ---------------------------------------
## Total 15497 19326 34823
## =======================================
crosstab(happy$happy, happy$sex, prop.r = TRUE, plot = FALSE)## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## |-------------------------|
##
## =======================================
## happy$sex
## happy$happy male female Total
## ---------------------------------------
## not too happy 1904 2460 4364
## 43.6% 56.4% 12.5%
## ---------------------------------------
## pretty happy 8760 10539 19299
## 45.4% 54.6% 55.4%
## ---------------------------------------
## very happy 4833 6327 11160
## 43.3% 56.7% 32.0%
## ---------------------------------------
## Total 15497 19326 34823
## =======================================
crosstab(happy$happy, happy$sex, prop.c = TRUE, plot = FALSE)## Cell Contents
## |-------------------------|
## | Count |
## | Column Percent |
## |-------------------------|
##
## =======================================
## happy$sex
## happy$happy male female Total
## ---------------------------------------
## not too happy 1904 2460 4364
## 12.3% 12.7%
## ---------------------------------------
## pretty happy 8760 10539 19299
## 56.5% 54.5%
## ---------------------------------------
## very happy 4833 6327 11160
## 31.2% 32.7%
## ---------------------------------------
## Total 15497 19326 34823
## 44.5% 55.5%
## =======================================
crosstab(happy$happy, happy$sex, prop.c = TRUE, prop.r = TRUE, plot = FALSE)## Cell Contents
## |-------------------------|
## | Count |
## | Row Percent |
## | Column Percent |
## |-------------------------|
##
## =======================================
## happy$sex
## happy$happy male female Total
## ---------------------------------------
## not too happy 1904 2460 4364
## 43.6% 56.4% 12.5%
## 12.3% 12.7%
## ---------------------------------------
## pretty happy 8760 10539 19299
## 45.4% 54.6% 55.4%
## 56.5% 54.5%
## ---------------------------------------
## very happy 4833 6327 11160
## 43.3% 56.7% 32.0%
## 31.2% 32.7%
## ---------------------------------------
## Total 15497 19326 34823
## 44.5% 55.5%
## =======================================
Each of the contingency tables encourages a different type of comparison, therefore the author has to decide in advance which comparison is most important and include the appropriate table. Alternatively, we can visualize this information using a mosaic plot, whereby the area of each rectangle is proportional to the number of observations falling into the respective contengencies.
There are a few different packages available for drawing mosaic plots in R.
graphics::mosaicplot()mosaicplot(~ sex + happy, data = happy)vcd::mosaic()library(vcd)
mosaic(~ happy + sex, happy)productplots::prodplot()ggplot2# mosaic plot using productplots
prodplot(happy, ~ happy + sex)# add color
prodplot(happy, ~ happy + sex) +
aes(fill = happy) +
theme(panel.grid = element_blank())prodplot(happy, ~ happy + marital) +
aes(fill = happy) +
theme(legend.position = "none") +
theme(panel.grid = element_blank())Notice that the mosaic plot is very similar to a proportional bar chart:
ggplot(happy, aes(marital, fill = happy)) +
geom_bar(position = "fill")However unlike a proportional bar chart, the bar widths are constant and therefore we do not know what proportion of individuals in the survey are married vs. never married, or any other similar comparison.
library(ISLR)
OJ_sum <- OJ %>%
select(ends_with("MM"), ends_with("CH")) %>%
gather(var, value) %>%
group_by(var) %>%
summarize(mean = mean(value),
sd = sd(value),
min = min(value),
max = max(value),
n = n())
# print the table
kable(OJ_sum)| var | mean | sd | min | max | n |
|---|---|---|---|---|---|
| DiscCH | 0.052 | 0.117 | 0.00 | 0.500 | 1070 |
| DiscMM | 0.123 | 0.214 | 0.00 | 0.800 | 1070 |
| LoyalCH | 0.566 | 0.308 | 0.00 | 1.000 | 1070 |
| PctDiscCH | 0.027 | 0.062 | 0.00 | 0.253 | 1070 |
| PctDiscMM | 0.059 | 0.102 | 0.00 | 0.402 | 1070 |
| PriceCH | 1.867 | 0.102 | 1.69 | 2.090 | 1070 |
| PriceMM | 2.085 | 0.134 | 1.69 | 2.290 | 1070 |
| SalePriceCH | 1.816 | 0.143 | 1.39 | 2.090 | 1070 |
| SalePriceMM | 1.962 | 0.253 | 1.19 | 2.290 | 1070 |
| SpecialCH | 0.148 | 0.355 | 0.00 | 1.000 | 1070 |
| SpecialMM | 0.162 | 0.368 | 0.00 | 1.000 | 1070 |
# plot using a single dot plot
ggplot(OJ_sum, aes(x = fct_reorder(var, mean), y = mean)) +
geom_linerange(aes(ymin = mean - 2 * sd,
ymax = mean + 2 * sd),
linetype = 2,
size = .25) +
geom_linerange(aes(ymin = mean - sd,
ymax = mean + sd),
size = 1) +
geom_point() +
coord_flip() +
labs(x = NULL,
y = NULL)# dodge based on OJ brand
OJ_sum %>%
separate(var, into = c("var", "brand"), -3, remove = TRUE) %>%
ggplot(aes(x = fct_reorder(var, mean), y = mean, color = brand)) +
geom_linerange(aes(ymin = mean - 2 * sd,
ymax = mean + 2 * sd),
linetype = 2,
size = .25,
position = position_dodge(width = 0.5)) +
geom_linerange(aes(ymin = mean - sd,
ymax = mean + sd),
size = 1,
position = position_dodge(width = 0.5)) +
geom_point(position = position_dodge(width = 0.5)) +
coord_flip() +
labs(x = NULL,
y = NULL,
color = "Brand")# facet based on OJ brand
OJ_sum %>%
separate(var, into = c("var", "brand"), -3, remove = TRUE) %>%
ggplot(aes(x = fct_reorder(var, mean), y = mean)) +
facet_grid(. ~ brand) +
geom_linerange(aes(ymin = mean - 2 * sd,
ymax = mean + 2 * sd),
linetype = 2,
size = .25) +
geom_linerange(aes(ymin = mean - sd,
ymax = mean + sd),
size = 1) +
geom_point() +
coord_flip() +
labs(x = NULL,
y = NULL,
color = "Brand")library(coefplot)
model <- lm(hwy ~ displ + I(displ^2), data = mpg)
# in a table
tidy(model) %>%
kable| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 49.24 | 1.858 | 26.51 | 0 |
| displ | -11.76 | 1.073 | -10.96 | 0 |
| I(displ^2) | 1.09 | 0.141 | 7.77 | 0 |
# in a plot
coefplot(model)displ <- lm(hwy ~ displ + I(displ^2), data = mpg)
displ_cyl <- lm(hwy ~ displ + I(displ^2) + cyl, data = mpg)
displ_cyl_drv <- lm(hwy ~ displ + I(displ^2) + cyl + drv, data = mpg)
displ_cyl_drv_class <- lm(hwy ~ displ + I(displ^2) + cyl + drv + class, data = mpg)
# in a table
library(stargazer)
stargazer(displ, displ_cyl, displ_cyl_drv, displ_cyl_drv_class,
type = "html",
title = "Regression models on highway fuel efficiency",
covariate.labels = c("Engine displacement", "$\\text{Engine displacement}\\^2$",
"Number of cylinders", "Front wheel drive",
"Rear wheel drive", "Compact", "Midsize",
"Minivan", "Pickup", "Subcompact", "SUV",
"Constant"),
dep.var.caption = "",
dep.var.labels.include = FALSE,
align =TRUE)| (1) | (2) | (3) | (4) | |
| Engine displacement | -11.800*** | -11.200*** | -7.150*** | -3.710*** |
| (1.070) | (1.410) | (1.220) | (1.280) | |
| Engine displacement^2 | 1.090*** | 1.060*** | 0.686*** | 0.381*** |
| (0.141) | (0.153) | (0.130) | (0.135) | |
| Number of cylinders | -0.264 | -0.746** | -1.040*** | |
| (0.411) | (0.343) | (0.315) | ||
| Front wheel drive | 4.520*** | 2.770*** | ||
| (0.496) | (0.619) | |||
| Rear wheel drive | 4.180*** | 1.170 | ||
| (0.686) | (0.734) | |||
| Compact | -2.780* | |||
| (1.510) | ||||
| Midsize | -2.630* | |||
| (1.550) | ||||
| Minivan | -6.500*** | |||
| (1.720) | ||||
| Pickup | -7.380*** | |||
| (1.470) | ||||
| Subcompact | -2.230 | |||
| (1.470) | ||||
| SUV | -6.560*** | |||
| (1.370) | ||||
| Constant | 49.200*** | 49.300*** | 40.800*** | 40.300*** |
| (1.860) | (1.860) | (1.750) | (2.050) | |
| Observations | 234 | 234 | 234 | 234 |
| R2 | 0.672 | 0.673 | 0.782 | 0.838 |
| Adjusted R2 | 0.670 | 0.669 | 0.778 | 0.830 |
| Residual Std. Error | 3.420 (df = 231) | 3.430 (df = 230) | 2.810 (df = 228) | 2.450 (df = 222) |
| F Statistic | 237.000*** (df = 2; 231) | 158.000*** (df = 3; 230) | 164.000*** (df = 5; 228) | 105.000*** (df = 11; 222) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
# in a plot
multiplot(displ, displ_cyl, displ_cyl_drv, displ_cyl_drv_class,
title = "Regression models on highway fuel efficiency",
xlab = NULL,
ylab = NULL,
newNames = c("displ" = "Engine displacement",
"I(displ^2)" = "Engine displacement^2",
"cyl" = "Number of cylinders",
"drvf" = "Front wheel drive",
"drvr" = "Rear wheel drive",
"classcompact" = "Compact",
"classmidsize" = "Midsize",
"classminivan" = "Minivan",
"classpickup" = "Pickup",
"classsubcompact" = "Subcompact",
"classsuv" = "SUV")) +
scale_color_discrete(labels = c("Model 1", "Model 2", "Model 3", "Model 4"))multiplot(displ, displ_cyl, displ_cyl_drv, displ_cyl_drv_class,
single = FALSE,
names = c("Model 1", "Model 2", "Model 3", "Model 4"),
title = "Regression models on highway fuel efficiency",
xlab = NULL,
ylab = NULL,
newNames = c("displ" = "Engine displacement",
"I(displ^2)" = "Engine displacement^2",
"cyl" = "Number of cylinders",
"drvf" = "Front wheel drive",
"drvr" = "Rear wheel drive",
"classcompact" = "Compact",
"classmidsize" = "Midsize",
"classminivan" = "Minivan",
"classpickup" = "Pickup",
"classsubcompact" = "Subcompact",
"classsuv" = "SUV")) +
theme(legend.position = "none")Visualizations are great tools for presenting and interpreting results from other regression-based models that do not use continuous dependent variables. For instance, consider the Titanic:
library(titanic)
titanic <- titanic_train %>%
as_tibble()
titanic %>%
head() %>%
knitr::kable()| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.28 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.92 | S | |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.10 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.05 | S | |
| 6 | 0 | 3 | Moran, Mr. James | male | NA | 0 | 0 | 330877 | 8.46 | Q |
If we estimate a logistic regression model (or models) of survival on the Titanic, the resulting parameters are very difficult to interpret because they are expressed in terms of log-odds, not something intuitive such as odds or probabilities. For example, a model based on age looks like the following:
survive_age <- glm(Survived ~ Age, data = titanic, family = binomial)
tidy(survive_age)## term estimate std.error statistic p.value
## 1 (Intercept) -0.0567 0.17358 -0.327 0.7438
## 2 Age -0.0110 0.00533 -2.057 0.0397
coefplot(survive_age)# generate predicted values
library(rcfss) # need to convert log-odds to probabilities
titanic_age <- titanic %>%
data_grid(Age) %>%
add_predictions(survive_age) %>%
# predicted values are in the log-odds form - convert to probabilities
mutate(prob = logit2prob(pred))
p_age <- ggplot(titanic_age, aes(Age, prob)) +
geom_line() +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Probability of surviving the Titanic",
subtitle = "Age",
y = "Predicted probability of survival",
color = "Sex")
p_ageBy visualizing the results as predicted probabilities (which are curvilinear and do not lend themselves well to presentation in a table), we have a more intuitive understanding of the results. For instance here, as age increases probability of survival decreases. Alternatively, in odds form:
survive_age_pred <- titanic_age %>%
mutate(odds = prob2odds(prob))
ggplot(survive_age_pred, aes(Age, odds)) +
geom_line(color = "blue", size = 1) +
labs(x = "Age",
y = "Odds of surviving the Titanic")Regardless of age, the odds of surviving the Titanic are always below 1. Considering the probability of even a 1 year old surviving was less than \(.50\), this should be expected. The relationship between age and the odds of survival is still curvilinear.
This is especially true once we introduce additional variables into the model:
survive_age_woman <- glm(Survived ~ Age + Sex, data = titanic,
family = binomial)
titanic_age_sex <- titanic %>%
data_grid(Age, Sex) %>%
add_predictions(survive_age_woman) %>%
mutate(pred = logit2prob(pred))
p_age_sex <- ggplot(titanic_age_sex, aes(Age, pred, color = Sex)) +
geom_line() +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Probability of surviving the Titanic",
subtitle = "Age + Sex",
y = "Predicted probability of survival",
color = "Sex")
p_age_sexsurvive_age_woman_x <- glm(Survived ~ Age * Sex, data = titanic,
family = binomial)
titanic_age_sex_x <- titanic %>%
data_grid(Age, Sex) %>%
add_predictions(survive_age_woman_x) %>%
mutate(pred = logit2prob(pred))
p_age_sex_x <- ggplot(titanic_age_sex_x, aes(Age, pred, color = Sex)) +
geom_line() +
scale_y_continuous(limits = c(0, 1)) +
labs(title = "Probability of surviving the Titanic",
subtitle = "Age x Sex",
y = "Predicted probability of survival",
color = "Sex")
p_age_sex_xIf we want to present the results simultaneously, we have a few options:
# in a table
stargazer(survive_age, survive_age_woman, survive_age_woman_x,
type = "html",
title = "Probability of surviving the Titanic",
covariate.labels = c("Age", "Male", "Age x Male", "Constant"),
dep.var.caption = "",
dep.var.labels.include = FALSE,
align = TRUE)| (1) | (2) | (3) | |
| Age | -0.011** | -0.005 | 0.020* |
| (0.005) | (0.006) | (0.011) | |
| Male | -2.470*** | -1.320*** | |
| (0.185) | (0.408) | ||
| Age x Male | -0.041*** | ||
| (0.014) | |||
| Constant | -0.057 | 1.280*** | 0.594* |
| (0.174) | (0.230) | (0.310) | |
| Observations | 714 | 714 | 714 |
| Log Likelihood | -480.000 | -375.000 | -370.000 |
| Akaike Inf. Crit. | 964.000 | 756.000 | 748.000 |
| Note: | p<0.1; p<0.05; p<0.01 | ||
# in a coefficient plot
multiplot(survive_age, survive_age_woman, survive_age_woman_x,
title = "Probability of surviving the Titanic",
xlab = "Parameter (log-odds)",
ylab = NULL,
newNames = c("Sexmale" = "Male",
"Age:Sexmale" = "Age x Male")) +
scale_color_discrete(labels = c("Age", "Age + Sex", "Age x Sex"))# in a predicted probability plot
bind_rows(list("Age" = titanic_age %>%
mutate(pred = prob),
"Age + Sex" = titanic_age_sex,
"Age x Sex" = titanic_age_sex_x), .id = "id") %>%
mutate(Sex = ifelse(is.na(Sex), "both", Sex)) %>%
# plot the two models
ggplot(aes(Age, pred, color = Sex, linetype = id)) +
geom_line() +
scale_y_continuous(limits = c(0, 1)) +
scale_color_manual(values = c("#1b9e77", "#d95f02", "#7570b3")) +
labs(title = "Probability of surviving the Titanic",
x = "Age",
y = "Predicted probability of survival",
color = "Sex",
linetype = "Model")devtools::session_info()## setting value
## version R version 3.3.3 (2017-03-06)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2017-04-26
##
## package * version date
## assertthat 0.1 2013-12-06
## backports 1.0.5 2017-01-18
## broom * 0.4.2 2017-02-13
## codetools 0.2-15 2016-10-05
## coefplot * 1.2.4.9000 2017-04-25
## colorspace 1.3-2 2016-12-14
## DBI 0.6 2017-03-09
## descr * 1.1.3 2016-05-11
## devtools 1.12.0 2016-06-24
## digest 0.6.12 2017-01-27
## dplyr * 0.5.0 2016-06-24
## evaluate 0.10 2016-10-11
## forcats * 0.2.0 2017-01-23
## foreign 0.8-67 2016-09-13
## ggplot2 * 2.2.1 2016-12-30
## gtable 0.2.0 2016-02-26
## haven * 1.0.0 2016-09-23
## hms 0.3 2016-11-22
## htmltools 0.3.5 2016-03-21
## httr 1.2.1 2016-07-03
## ISLR * 1.0 2013-06-11
## jsonlite 1.3 2017-02-28
## knitr * 1.15.1 2016-11-22
## labeling 0.3 2014-08-23
## lattice 0.20-34 2016-09-06
## lazyeval 0.2.0 2016-06-12
## lmtest 0.9-35 2017-02-11
## lubridate 1.6.0 2016-09-13
## magrittr 1.5 2014-11-22
## MASS 7.3-45 2016-04-21
## Matrix 1.2-8 2017-01-20
## MatrixModels * 0.4-1 2015-08-22
## memoise 1.0.0 2016-01-29
## mnormt 1.5-5 2016-10-15
## modelr * 0.1.0 2016-08-31
## munsell 0.4.3 2016-02-13
## nlme 3.1-131 2017-02-06
## plyr 1.8.4 2016-06-08
## productplots * 0.1.1.9000 2017-04-25
## psych 1.7.3.21 2017-03-22
## purrr * 0.2.2 2016-06-18
## quantreg * 5.29 2016-09-04
## R6 2.2.0 2016-10-05
## rcfss * 0.1.4 2017-02-28
## Rcpp 0.12.10 2017-03-19
## readr * 1.1.0 2017-03-22
## readxl 0.1.1 2016-03-28
## reshape2 1.4.2 2016-10-22
## rmarkdown 1.3 2016-12-21
## rprojroot 1.2 2017-01-16
## rvest 0.3.2 2016-06-17
## scales 0.4.1 2016-11-09
## SparseM * 1.74 2016-11-10
## stargazer * 5.2 2015-07-14
## stringi 1.1.2 2016-10-01
## stringr * 1.2.0 2017-02-18
## tibble * 1.2 2016-08-26
## tidyr * 0.6.1 2017-01-10
## tidyverse * 1.1.1 2017-01-27
## titanic * 0.1.0 2015-08-31
## useful 1.2.1 2016-06-29
## vcd * 1.4-3 2016-09-17
## withr 1.0.2 2016-06-20
## xml2 1.1.1 2017-01-24
## xtable 1.8-2 2016-02-05
## yaml 2.1.14 2016-11-12
## zoo 1.7-14 2016-12-16
## source
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.3)
## Github (jaredlander/coefplot@0755a00)
## CRAN (R 3.3.2)
## CRAN (R 3.3.3)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.3)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## cran (@1.0.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## cran (@1.15.1)
## CRAN (R 3.3.0)
## CRAN (R 3.3.3)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.3)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.3)
## CRAN (R 3.3.0)
## Github (hadley/productplots@391f500)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## local
## cran (@0.12.10)
## cran (@1.1.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.1)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## cran (@1.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.0)
## CRAN (R 3.3.2)
## CRAN (R 3.3.0)
## cran (@2.1.14)
## CRAN (R 3.3.2)